Data Visualization

As with all new datasets, let's start by familiarizing ourselves with the dataset:

Try it! Print the shape, columns, and show a sample observation

Scatter, Bars, and Histograms: The Basics

Some imports: Note that we'll rename plotly express as px.

plotly express is a "wrapper" for the base plotly package. What that means is we can use incredibly easy and readable functions, and plotly express will do the hard work of convering that input into formats that the software can understand.

Quick aside: If you're a web developer and love JS, or an academic and use R, the same Plotly library is available to use in both languages.

Our overarching goal: What are the average prices in each neighborhood?

In order to get a handle on the data, let's use a histogram to show single distributions. How would you know what function to use? Google

There's a couple of outliers here screwing with the distribution. Any idea on how to fix it?

It works, but doesn't really tell us too much. Let's modify the plot by adding some parameters.

With any python package, we can pull up some quick documentation from Jupyter itself using ?
Try it! What parameters does px.histogram accept?

How does this differ by neighborhood? Use 'neighbourhood_group' as a breakout

Quick aesthetics

Plotly is interactive! Play around with the legends and plot area.
Double click on the legend icon on the right, and plotly will automatically update the figure to select those points only.

Let's use some other colors. There are two main types: discrete sequences and continuous scales. As you can imagine, if the data you're interested in has distinct groups (i.e. neighborhoods), you'd be interested in using color_discrete_sequence=. If the latter, use color_continuous_sequence=.

How do you know what options are available? Plotly has several default options for each type. You can check them out using px.colors.qualitative.swatches() for discrete options or px.colors.sequential.swatches() for continuous scales.

As a reminder, the color= parameter only breaks the graph into different colors, based upon the attribute (column) given. To actually change the values, we need to specify a set of colors to another parameter.

We can change our colors fairly easily using color scales.


If the feature we pass to color= is discrete or categorical, we'll add the color_discrete_sequence param


If the feature is instead continuous, we'll use the color_continuous_scale param instead


Open the docs, and try out your favorite below:

Finally, we can add labels to our charts as dictionaries, in the form of {'column_name':'Column Name', 'another_name':'Another Name'}

Say we want to adjust this plot to show relative values. That is, we want to better highlight the price distributions of hotel rooms, even though they occur a lot less than enitre homes/apts.

Using GroupBys for Aggregation

Now, let's drop the columns that make no sense to have a median of.

Oftentimes we'll want to create visualizations at some aggregate level.

For example, let's say we want to show neighborhoods with a high median rental price.
Our data is at a per-listing level, meaning that each individual row is its own listing, with its price.
To get data at the per-neighborhood level, we've got to roll up all the listing prices per neighborhood, in other words, group the data by neighborhood, then find the median for all those listings.

In breakout groups, see if you can build a bar plot to show median prices in each neighbourhood group, and sort them in a meaningful way

Make it complete! Label axes, hover text, color, the whole nine yards.

Say my friend and I have a budget of of $90 per night. Show which regions are ideal for this, but how you wanna do that is entirely up to you: Draw a horizontal line, color the bars by color the ideal regions differently, as long as it communicates the which neighborhoods are generally cheaper.

Hint: To draw a line, use fig.add_hline() with corresponding parameters
Hint: To color bars according to some condition, first create a new column that describes if the value is below budget.

Tabular data comes in two formats: wide or long.

Wide form puts the core observational unit as it's own row, while long-form data shows each possible data combination as its own row. In practice, wide data is more human-readable, while long form data tends to lend itself better for visualization tasks.

Pandas gives us a set of functions to switch back and forth between the two formats, as needed.

Let's use the Zillow dataset to explore this further. Our overarching goal is to show price trends in Seattle neighborhoods.

What regions are covered in the zillow data set?

Let's subset our data to just Seattle

Moving from wide to long form

A quick bit of cleaning is necessary here. Right now, each row in zillow represents a unique region, and the Zillow Rental Index (ZRI) value for each month is given in its own column (113 months = 113 columns). For visualization, we'd like each region - month combination to be its own row.

Look at the Pandas cheatsheet to see what the relevant operation should be.

Pandas Cheatsheet

Let's use pd.melt() to move from wide to long, making sure to google to documentation along the way.

To make this more tangible, our goal is to convert our wide data in the form of:
RegionName | 2010-09 | 2010-10 | 2010-11 ... to the long form
RegionName | Date | ZRI.

Hint: RegionID through SizeRank are all ID variables. This means that they are unique to each observation, and should not be dropped or pivoted in the transformation.

Let's take a look at a particular region, say Capitol Hill.

Interpolating Data

Because of missing datapoints, we have to interpolate some values, or make a best guess based on values before or after the gap

Making a Time Series Plot

Now that we've got data in a long format, we can easily use plotly express functions to create a time series. Say I'm interested in the following neighborhoods: 'Denny Triangle', 'First Hill', 'Capitol Hill', 'Belltown', 'Uptown'. Plot the time trends, and make it pretty.

Advanced Topics: Geographic Plots

There's quite a few different ways to show geogrpahical data, usually with choropleth charts or scatter plots.
Our friend Plotly has them all: https://plotly.com/python/maps/

A quick note about how this work before letting you leaf through the docs page.


Most of the params in px.scatter_mapbox() behave pretty similarly to px.scatter, except that we provide latitude and longitude data instead of x and y. Luckily, our dataset already has that included, but oftentimes we'll have to find a lookup table online to convert city names, for example, to lat / lon coordinates.


We don't necessarily have to provide a value to size=, but that usually can help highlight points of interest.
zoom= on the other hand, just changes how zoomed in the initial picture is when first loaded.


Finally, we'll have to update the mapbox_style= parameter of the figure to a specific base map to load.
For more information on what options are available here, check out https://plotly.com/python/mapbox-layers/

Show places in Downtown, Central Area, and Capitol Hill, and highlight those under budget

Geocoding Addresses